In-class Exercise 5

Author

Lee Chee Tian

Modeling the Spatial Variation of the Explanatory Factors of Water Point Status using Geographically Weighted Logistic Regression (GWLR)

Objective

  • To build an explanatory model to discover factor affecting water point status in Osun State, Nigeria

  • Study area: Osun State

  • Data sets:

    • Osun.rds, contains LGAs boundaries of State. It is in sf polygon data frame, and

    • Osun_wp_sf.rds, contains water points within Osun State. It is also in sf polygon data frame.

Model Variables

  • Dependent variable: Water point status (ie. functional/non-functional)

  • Independent variables:

    • distance_to_primary_road

    • distance_to_secondary_road

    • distance_to_tertiary_road

    • distance_to_city

    • distance_to_town

    • water_point_population

    • local_population_1km

    • usage_capacity

    • is_urban

    • water_source_clean

Getting Started

Loading R Packages

For this in-class exercise, we shall load the following R packages:

  • Spatial data handling
    • sf, spdep
  • Attribute data handling
    • tidyverse, especially readr, ggplot2 and dplyr
  • Choropleth mapping
    • tmap
  • Multivariate data visualization and analysis
    • corrplot, ggpubr
  • EDA
    • funModeling, skimr
  • New packages introduced for this exercise
    • blorr, caret
pacman::p_load(sf, tidyverse, funModeling, blorr, corrplot, ggpubr, 
               spdep, GWmodel, tmap, skimr, caret)

Importing Analytical Data into R

In this in-class exercise, 2 data sets will be used:

Osun <- read_rds("data/rds/Osun.rds")
Osun_wp_sf <- read_rds("data/rds/Osun_wp_sf.rds")
Osun_wp_sf %>%
  freq(input = 'status')
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <https://github.com/pablo14/funModeling/issues>.

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0

This state (Osun) is selected for this exercise because of its high percentage of non-functional water points.

tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(Osun) +
  tm_polygons(alpha = 0.4) +
tm_shape(Osun_wp_sf) +
  tm_dots(col = "status", alpha = 0.6) +
  tm_view(set.zoom.limits = c(8, 14))

Exploratory Data Analysis

Summary statistics with skimr:

Osun_wp_sf %>%
  skim()
Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

Those variables with high number of missing values are omitted, eg. install_year.

Filtering out those variables with missing values and retaining those that are more complete:

Osun_wp_sf_clean <- Osun_wp_sf %>%
  filter_at(vars(status,
                 distance_to_primary_road,
                 distance_to_secondary_road,
                 distance_to_tertiary_road,
                 distance_to_city,
                 distance_to_town,
                 water_point_population,
                 local_population_1km,
                 usage_capacity,
                 is_urban,
                 water_source_clean),
            all_vars(!is.na(.))) %>%
  mutate(usage_capacity = as.factor(usage_capacity))

Correlation Analysis

Osun_wp <- Osun_wp_sf_clean %>%
  select(c(7, 35:39, 42:43, 46:47, 57)) %>%
  st_set_geometry(NULL)
cluster_vars.cor = cor(Osun_wp[, 2:7])
corrplot.mixed(cluster_vars.cor,
               lower = "ellipse",
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

There is no sign of multicollinearity as none of the correlation coefficients is > 0.8.

Building a Logistic Regression Model

model <- glm(status ~ 
               distance_to_primary_road +
               distance_to_secondary_road +
               distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = Osun_wp_sf_clean,
             family = binomial(link = "logit"))

Using blr_regress function of blorr package to visualize the model:

blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4744           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3887        0.1124      3.4588       5e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7153      0.4744 
       distance_to_secondary_road           1      0.0000        0.0000     -0.5530      0.5802 
       distance_to_tertiary_road            1      1e-04         0.0000      4.6708      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7574      0.0000 
            distance_to_town                1      0.0000        0.0000     -4.9170      0.0000 
              is_urbanTRUE                  1     -0.2971        0.0819     -3.6294       3e-04 
           usage_capacity1000               1     -0.6230        0.0697     -8.9366      0.0000 
water_source_cleanProtected Shallow Well    1      0.5040        0.0857      5.8783      0.0000 
   water_source_cleanProtected Spring       1      1.2882        0.4388      2.9359      0.0033 
         water_point_population             1      -5e-04        0.0000    -11.3686      0.0000 
          local_population_1km              1      3e-04         0.0000     19.2953      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7347          Somers' D        0.4693   
% Discordant          0.2653          Gamma            0.4693   
% Tied                0.0000          Tau-a            0.2318   
Pairs                5585188          c                0.7347   
---------------------------------------------------------------

A report is generated to present the model.

It is observed that the variables distance_to_primary_road and distance_to_secondary_road have p-values that are larger than 0.05, which implies that these variables are statistically insignificant, hence they will be dropped from subsequent analysis.

Reconstructing the model with the 2 non-significant variables removed:

model <- glm(status ~ 
               distance_to_tertiary_road +
               distance_to_city +
               distance_to_town +
               is_urban +
               usage_capacity +
               water_source_clean +
               water_point_population +
               local_population_1km,
             data = Osun_wp_sf_clean,
             family = binomial(link = "logit"))

Interpreting and checking the performance of the model:

blr_confusion_matrix(model, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1300  743
         1   814 1899

                Accuracy : 0.6726 
     No Information Rate : 0.4445 

                   Kappa : 0.3348 

McNemars's Test P-Value  : 0.0761 

             Sensitivity : 0.7188 
             Specificity : 0.6149 
          Pos Pred Value : 0.7000 
          Neg Pred Value : 0.6363 
              Prevalence : 0.5555 
          Detection Rate : 0.3993 
    Detection Prevalence : 0.5704 
       Balanced Accuracy : 0.6669 
               Precision : 0.7000 
                  Recall : 0.7188 

        'Positive' Class : 1

Building Geographically Weighted Logistic Regression (GWLR) Model

Converting from sf to sp data frame

Osun_wp_sp <- Osun_wp_sf_clean %>%
  select(c(status,
           distance_to_tertiary_road,
           distance_to_city,
           distance_to_town,
           water_point_population,
           local_population_1km,
           usage_capacity,
           is_urban,
           water_source_clean)) %>%
  as_Spatial()
Osun_wp_sp
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 9
names       : status, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, usage_capacity, is_urban, water_source_clean 
min values  :      0,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,           1000,        0,           Borehole 
max values  :      1,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,            300,        1,   Protected Spring 

Building Fixed Bandwidth GWR Model

Computing fixed bandwidth

bw.fixed <- bw.ggwr(status ~ 
                      distance_to_tertiary_road +
                      distance_to_city +
                      distance_to_town +
                      is_urban +
                      usage_capacity +
                      water_source_clean +
                      water_point_population +
                      local_population_1km,
               data = Osun_wp_sp,
               family = "binomial",
               approach = "AIC",
               kernel = "gaussian",
               adaptive = FALSE,
               longlat = FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
 Iteration    Log-Likelihood:(With bandwidth:  95768.67 )
=========================
       0        -2890 
       1        -2837 
       2        -2830 
       3        -2829 
       4        -2829 
       5        -2829 
Fixed bandwidth: 95768.67 AICc value: 5681.18 
 Iteration    Log-Likelihood:(With bandwidth:  59200.13 )
=========================
       0        -2878 
       1        -2820 
       2        -2812 
       3        -2810 
       4        -2810 
       5        -2810 
Fixed bandwidth: 59200.13 AICc value: 5645.901 
 Iteration    Log-Likelihood:(With bandwidth:  36599.53 )
=========================
       0        -2854 
       1        -2790 
       2        -2777 
       3        -2774 
       4        -2774 
       5        -2774 
       6        -2774 
Fixed bandwidth: 36599.53 AICc value: 5585.354 
 Iteration    Log-Likelihood:(With bandwidth:  22631.59 )
=========================
       0        -2810 
       1        -2732 
       2        -2711 
       3        -2707 
       4        -2707 
       5        -2707 
       6        -2707 
Fixed bandwidth: 22631.59 AICc value: 5481.877 
 Iteration    Log-Likelihood:(With bandwidth:  13998.93 )
=========================
       0        -2732 
       1        -2635 
       2        -2604 
       3        -2597 
       4        -2596 
       5        -2596 
       6        -2596 
Fixed bandwidth: 13998.93 AICc value: 5333.718 
 Iteration    Log-Likelihood:(With bandwidth:  8663.649 )
=========================
       0        -2624 
       1        -2502 
       2        -2459 
       3        -2447 
       4        -2446 
       5        -2446 
       6        -2446 
       7        -2446 
Fixed bandwidth: 8663.649 AICc value: 5178.493 
 Iteration    Log-Likelihood:(With bandwidth:  5366.266 )
=========================
       0        -2478 
       1        -2319 
       2        -2250 
       3        -2225 
       4        -2219 
       5        -2219 
       6        -2220 
       7        -2220 
       8        -2220 
       9        -2220 
Fixed bandwidth: 5366.266 AICc value: 5022.016 
 Iteration    Log-Likelihood:(With bandwidth:  3328.371 )
=========================
       0        -2222 
       1        -2002 
       2        -1894 
       3        -1838 
       4        -1818 
       5        -1814 
       6        -1814 
Fixed bandwidth: 3328.371 AICc value: 4827.587 
 Iteration    Log-Likelihood:(With bandwidth:  2068.882 )
=========================
       0        -1837 
       1        -1528 
       2        -1357 
       3        -1261 
       4        -1222 
       5        -1222 
Fixed bandwidth: 2068.882 AICc value: 4772.046 
 Iteration    Log-Likelihood:(With bandwidth:  1290.476 )
=========================
       0        -1403 
       1        -1016 
       2       -807.3 
       3       -680.2 
       4       -680.2 
Fixed bandwidth: 1290.476 AICc value: 5809.719 
 Iteration    Log-Likelihood:(With bandwidth:  2549.964 )
=========================
       0        -2019 
       1        -1753 
       2        -1614 
       3        -1538 
       4        -1506 
       5        -1506 
Fixed bandwidth: 2549.964 AICc value: 4764.056 
 Iteration    Log-Likelihood:(With bandwidth:  2847.289 )
=========================
       0        -2108 
       1        -1862 
       2        -1736 
       3        -1670 
       4        -1644 
       5        -1644 
Fixed bandwidth: 2847.289 AICc value: 4791.834 
 Iteration    Log-Likelihood:(With bandwidth:  2366.207 )
=========================
       0        -1955 
       1        -1675 
       2        -1525 
       3        -1441 
       4        -1407 
       5        -1407 
Fixed bandwidth: 2366.207 AICc value: 4755.524 
 Iteration    Log-Likelihood:(With bandwidth:  2252.639 )
=========================
       0        -1913 
       1        -1623 
       2        -1465 
       3        -1376 
       4        -1341 
       5        -1341 
Fixed bandwidth: 2252.639 AICc value: 4759.188 
 Iteration    Log-Likelihood:(With bandwidth:  2436.396 )
=========================
       0        -1980 
       1        -1706 
       2        -1560 
       3        -1479 
       4        -1446 
       5        -1446 
Fixed bandwidth: 2436.396 AICc value: 4756.675 
 Iteration    Log-Likelihood:(With bandwidth:  2322.828 )
=========================
       0        -1940 
       1        -1656 
       2        -1503 
       3        -1417 
       4        -1382 
       5        -1382 
Fixed bandwidth: 2322.828 AICc value: 4756.471 
 Iteration    Log-Likelihood:(With bandwidth:  2393.017 )
=========================
       0        -1965 
       1        -1687 
       2        -1539 
       3        -1456 
       4        -1422 
       5        -1422 
Fixed bandwidth: 2393.017 AICc value: 4755.57 
 Iteration    Log-Likelihood:(With bandwidth:  2349.638 )
=========================
       0        -1949 
       1        -1668 
       2        -1517 
       3        -1432 
       4        -1398 
       5        -1398 
Fixed bandwidth: 2349.638 AICc value: 4755.753 
 Iteration    Log-Likelihood:(With bandwidth:  2376.448 )
=========================
       0        -1959 
       1        -1680 
       2        -1530 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2376.448 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2382.777 )
=========================
       0        -1961 
       1        -1683 
       2        -1534 
       3        -1450 
       4        -1416 
       5        -1416 
Fixed bandwidth: 2382.777 AICc value: 4755.491 
 Iteration    Log-Likelihood:(With bandwidth:  2372.536 )
=========================
       0        -1958 
       1        -1678 
       2        -1528 
       3        -1445 
       4        -1411 
       5        -1411 
Fixed bandwidth: 2372.536 AICc value: 4755.488 
 Iteration    Log-Likelihood:(With bandwidth:  2378.865 )
=========================
       0        -1960 
       1        -1681 
       2        -1532 
       3        -1448 
       4        -1414 
       5        -1414 
Fixed bandwidth: 2378.865 AICc value: 4755.481 
 Iteration    Log-Likelihood:(With bandwidth:  2374.954 )
=========================
       0        -1959 
       1        -1679 
       2        -1530 
       3        -1446 
       4        -1412 
       5        -1412 
Fixed bandwidth: 2374.954 AICc value: 4755.482 
 Iteration    Log-Likelihood:(With bandwidth:  2377.371 )
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2377.371 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2377.942 )
=========================
       0        -1960 
       1        -1680 
       2        -1531 
       3        -1448 
       4        -1414 
       5        -1414 
Fixed bandwidth: 2377.942 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2377.018 )
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2377.018 AICc value: 4755.48 
bw.fixed
[1] 2377.371
gwlr.fixed <- ggwr.basic(status ~ 
                           distance_to_tertiary_road +
                           distance_to_city +
                           distance_to_town +
                           is_urban +
                           usage_capacity +
                           water_source_clean +
                           water_point_population +
                           local_population_1km,
                         data = Osun_wp_sp,
                         bw = bw.fixed,
                         family = "binomial",
                         kernel = "gaussian",
                         adaptive = FALSE,
                         longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
gwlr.fixed
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-17 12:11:23 
   Call:
   ggwr.basic(formula = status ~ distance_to_tertiary_road + distance_to_city + 
    distance_to_town + is_urban + usage_capacity + water_source_clean + 
    water_point_population + local_population_1km, data = Osun_wp_sp, 
    bw = bw.fixed, family = "binomial", kernel = "gaussian", 
    adaptive = FALSE, longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_tertiary_road distance_to_city distance_to_town is_urban usage_capacity water_source_clean water_point_population local_population_1km
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-129.368    -1.750     1.074     1.742    34.126  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.540e-01  1.055e-01   3.354 0.000796
distance_to_tertiary_road                 1.001e-04  2.040e-05   4.910 9.13e-07
distance_to_city                         -1.764e-05  3.391e-06  -5.202 1.97e-07
distance_to_town                         -1.544e-05  2.825e-06  -5.466 4.60e-08
is_urbanTRUE                             -2.667e-01  7.474e-02  -3.569 0.000358
usage_capacity1000                       -6.206e-01  6.966e-02  -8.908  < 2e-16
water_source_cleanProtected Shallow Well  4.947e-01  8.496e-02   5.823 5.79e-09
water_source_cleanProtected Spring        1.279e+00  4.384e-01   2.917 0.003530
water_point_population                   -5.098e-04  4.476e-05 -11.390  < 2e-16
local_population_1km                      3.452e-04  1.779e-05  19.407  < 2e-16
                                            
Intercept                                ***
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
is_urbanTRUE                             ***
usage_capacity1000                       ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
water_point_population                   ***
local_population_1km                     ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.9  on 4746  degrees of freedom
AIC: 5708.9

Number of Fisher Scoring iterations: 5


 AICc:  5708.923
 Pseudo R-square value:  0.129406
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2377.371 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -3.7021e+02 -4.3797e+00  3.5590e+00
   distance_to_tertiary_road                -3.1622e-02 -4.5462e-04  9.1291e-05
   distance_to_city                         -5.4555e-02 -6.5623e-04 -1.3507e-04
   distance_to_town                         -8.6549e-03 -5.2754e-04 -1.6785e-04
   is_urbanTRUE                             -7.3554e+02 -3.4675e+00 -1.6596e+00
   usage_capacity1000                       -5.5889e+01 -1.0347e+00 -4.1960e-01
   water_source_cleanProtected.Shallow.Well -1.8842e+02 -4.7295e-01  6.2378e-01
   water_source_cleanProtected.Spring       -1.3630e+03 -5.3436e+00  2.7714e+00
   water_point_population                   -2.9696e-02 -2.2705e-03 -1.2277e-03
   local_population_1km                     -7.7730e-02  4.4281e-04  1.0548e-03
                                                3rd Qu.      Max.
   Intercept                                 1.3755e+01 2171.6373
   distance_to_tertiary_road                 6.3011e-04    0.0237
   distance_to_city                          1.5921e-04    0.0162
   distance_to_town                          2.4490e-04    0.0179
   is_urbanTRUE                              1.0554e+00  995.1840
   usage_capacity1000                        3.9113e-01    9.2449
   water_source_cleanProtected.Shallow.Well  1.9564e+00   66.8914
   water_source_cleanProtected.Spring        7.0805e+00  208.3749
   water_point_population                    4.5879e-04    0.0765
   local_population_1km                      1.8479e-03    0.0333
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 2815.659 
   AIC : 4418.776 
   AICc : 4744.213 
   Pseudo R-square value:  0.5691072 

   ***********************************************************************
   Program stops at: 2022-12-17 12:11:55 

The AIC value of the GWR model (4418.776) is less than that of the Global model (5708.9), implying an improvement in the GWR model over the Global model.

Model Assessment

Converting SDF into sf data.frame

gwr.fixed <- as.data.frame(gwlr.fixed$SDF)

Next, we will label yhat values greater or equal to 0.5 into 1 else 0.

gwr.fixed <- gwr.fixed %>%
  mutate(most = ifelse(
    gwr.fixed$yhat >= 0.5, T, F))
gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data = gwr.fixed$most, reference = gwr.fixed$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1833  268
     TRUE    281 2374
                                          
               Accuracy : 0.8846          
                 95% CI : (0.8751, 0.8935)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7661          
                                          
 Mcnemar's Test P-Value : 0.6085          
                                          
            Sensitivity : 0.8671          
            Specificity : 0.8986          
         Pos Pred Value : 0.8724          
         Neg Pred Value : 0.8942          
             Prevalence : 0.4445          
         Detection Rate : 0.3854          
   Detection Prevalence : 0.4418          
      Balanced Accuracy : 0.8828          
                                          
       'Positive' Class : FALSE           
                                          

The accuracy of this model is increased significantly to 0.8846. This proves that geographically weighted model is greatly improved. Especially the specificity value, which shows that this model is better at predicting the non-functional water points.

Visualizing GWLR

Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
  select(c(ADM2_EN, ADM2_PCODE,
           ADM1_EN, ADM1_PCODE,
           status))
gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed)

Visualizing coefficient estimates

tmap_mode("view")
tmap mode set to interactive viewing
prob_T <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed) +
  tm_dots(col = "yhat",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(8, 14))
prob_T
tertiary_TV <- tm_shape(Osun) +
  tm_polygons(alpha = 0.1) +
  tm_shape(gwr_sf.fixed) +
  tm_dots(col = "distance_to_tertiary_road_TV", 
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(8, 14))
tmap_arrange(prob_T, tertiary_TV, asp = 1, ncol = 2, sync = TRUE)
Variable(s) "distance_to_tertiary_road_TV" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.